AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.
You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.
To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.
ID: Customer IDAge: Customer’s age in completed yearsExperience: #years of professional experienceIncome: Annual income of the customer (in thousand dollars)ZIP Code: Home Address ZIP code.Family: the Family size of the customerCCAvg: Average spending on credit cards per month (in thousand dollars)Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/ProfessionalMortgage: Value of house mortgage if any. (in thousand dollars)Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)Online: Do customers use internet banking facilities? (0: No, 1: Yes)CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)To complete this exploratory analysis and build a model to identify potential customers who have a higher probability of purchasing the loan, this domain knowledge will be helpful:
1) Banking and Finance
2) Marketing and Customer Segmentation
3) Evaluation Metrics
4) Ethics and Compliance
Based on Zip codes in dataset:
Official website for State of California
https://dfpi.ca.gov/consumer-financial-education-other-loans/
Personal Loans One of the most attractive things about personal loans is they can be used for any reason. Personal loans may be an option for people with credit card debt and want to reduce their interest rate by transferring balances. Like other loans, the interest rate and loan terms depend on your credit history and financial situation. The term of a personal loan is generally between 12-60 months, the amount can be as little as \$1,000 to as much as $100,000 or more, and the APR interest may range from 6% – 36%. It is important to consider multiple lenders and negotiate the best terms for your situation.
# Installing the libraries with the specified version.
!pip install numpy==1.25.2 pandas==1.5.3 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 sklearn-pandas==2.2.0 -q --user
# Library to suppress warnings or deprecation notes
import warnings
warnings.filterwarnings("ignore")
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# Library to split data
from sklearn.model_selection import train_test_split
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# Libraries to build K-Means Clustering
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from scipy import stats
# Libraries to build decision tree classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# To tune different models
from sklearn.model_selection import GridSearchCV
# To perform statistical analysis
import scipy.stats as stats
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
ConfusionMatrixDisplay,
make_scorer,
)
Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.
# Initial load
data = pd.read_csv("/content/Loan_Modelling.csv")
# Copy data to another variable to preserve original
loan = data.copy()
# First 5 columns
loan.head()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
# Last 5 columns
loan.tail()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4995 | 4996 | 29 | 3 | 40 | 92697 | 1 | 1.9 | 3 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4996 | 4997 | 30 | 4 | 15 | 92037 | 4 | 0.4 | 1 | 85 | 0 | 0 | 0 | 1 | 0 |
| 4997 | 4998 | 63 | 39 | 24 | 93023 | 2 | 0.3 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4998 | 4999 | 65 | 40 | 49 | 90034 | 3 | 0.5 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4999 | 5000 | 28 | 4 | 83 | 92612 | 3 | 0.8 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
# Shape of dataset
loan.shape
(5000, 14)
loan.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null int64 3 Income 5000 non-null int64 4 ZIPCode 5000 non-null int64 5 Family 5000 non-null int64 6 CCAvg 5000 non-null float64 7 Education 5000 non-null int64 8 Mortgage 5000 non-null int64 9 Personal_Loan 5000 non-null int64 10 Securities_Account 5000 non-null int64 11 CD_Account 5000 non-null int64 12 Online 5000 non-null int64 13 CreditCard 5000 non-null int64 dtypes: float64(1), int64(13) memory usage: 547.0 KB
All numeric Dtypes
No missing values
Small memory usage
# Statistical Summary
loan.describe()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.00000 | 5000.000000 | 5000.000000 |
| mean | 2500.500000 | 45.338400 | 20.104600 | 73.774200 | 93169.257000 | 2.396400 | 1.937938 | 1.881000 | 56.498800 | 0.096000 | 0.104400 | 0.06040 | 0.596800 | 0.294000 |
| std | 1443.520003 | 11.463166 | 11.467954 | 46.033729 | 1759.455086 | 1.147663 | 1.747659 | 0.839869 | 101.713802 | 0.294621 | 0.305809 | 0.23825 | 0.490589 | 0.455637 |
| min | 1.000000 | 23.000000 | -3.000000 | 8.000000 | 90005.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 |
| 25% | 1250.750000 | 35.000000 | 10.000000 | 39.000000 | 91911.000000 | 1.000000 | 0.700000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 |
| 50% | 2500.500000 | 45.000000 | 20.000000 | 64.000000 | 93437.000000 | 2.000000 | 1.500000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 1.000000 | 0.000000 |
| 75% | 3750.250000 | 55.000000 | 30.000000 | 98.000000 | 94608.000000 | 3.000000 | 2.500000 | 3.000000 | 101.000000 | 0.000000 | 0.000000 | 0.00000 | 1.000000 | 1.000000 |
| max | 5000.000000 | 67.000000 | 43.000000 | 224.000000 | 96651.000000 | 4.000000 | 10.000000 | 3.000000 | 635.000000 | 1.000000 | 1.000000 | 1.00000 | 1.000000 | 1.000000 |
All columns have a count of 5000, meaning no missing values.
Age has a mean of 45 and a standard deviation of about 11.4. The min age is 23 and the max is 67.
Experience has a mean of 20 and a standard deviation of 11.5. The min is -3 and the max is 43 years.
Income has a mean of 74K and a standard deviation of 46K. The values range from 8K to 224K.
Zip codes will be analyzed further
There are 4 unique values in family column.
CCavg has a mean of 1.94 and a standard deviation of 1.7. The values range from min 0.0 to max 10.0.
Education column has 3 unique values
Mortgage has a mean of 56.5K and a standard deviation of 101K. The standard deviation is greater than the mean. We will investigate further.
Personal_Loan Securities_Account CD_Account Online CreditCard will be analyzed futher
# Check null values
loan.isnull().sum()
ID 0 Age 0 Experience 0 Income 0 ZIPCode 0 Family 0 CCAvg 0 Education 0 Mortgage 0 Personal_Loan 0 Securities_Account 0 CD_Account 0 Online 0 CreditCard 0 dtype: int64
No missing values present
# Check for duplicates
loan.duplicated().sum()
0
No duplicates present
Questions:
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None, ascending=False):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 2, 6))
else:
plt.figure(figsize=(n + 2, 6))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n],
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# function to plot stacked bar chart
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 6))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (15,10))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a triangle will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
#Show first few columns for reminder
loan.head()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
# review ID column
loan['ID'].value_counts()
ID
1 1
3331 1
3338 1
3337 1
3336 1
..
1667 1
1666 1
1665 1
1664 1
5000 1
Name: count, Length: 5000, dtype: int64
loan['ID'].value_counts().sum()
5000
ID column will not be further analyzed as each ID is unique.
# Review Zipcode column
loan['ZIPCode'].value_counts()
ZIPCode
94720 169
94305 127
95616 116
90095 71
93106 57
...
96145 1
94087 1
91024 1
93077 1
94598 1
Name: count, Length: 467, dtype: int64
Further Insight
1 Age
2 Experience
3 Income
4 ZIPCode
5 Family
6 CCAvg
7 Education
8 Mortgage
9 Personal_Loan
10 Securities_Account
11 CD_Account
12 Online
13 CreditCard
# Histplot & Boxplot
histogram_boxplot(loan, "Age")
# Breakdown of Age
labeled_barplot(loan, "Age", perc=True, n=None, ascending = True)
Futher insight
# Histplot & Boxplot
histogram_boxplot(loan, "Experience")
# Breakdown of experience
labeled_barplot(loan, "Experience", perc=True, n=None, ascending = True)
#Max value
loan['Experience'].max()
43
#Min value
loan['Experience'].min()
-3
Futher insight
# Histplot & Boxplot
histogram_boxplot(loan, "Income")
loan['Income'].max()
224
loan['Income'].min()
8
loan['Income'].median()
64.0
# Check recurring zip codes
loan['ZIPCode'].nunique()
467
# Highest count for zip codes
loan['ZIPCode'].value_counts()
ZIPCode
94720 169
94305 127
95616 116
90095 71
93106 57
...
96145 1
94087 1
91024 1
93077 1
94598 1
Name: count, Length: 467, dtype: int64
#Count occurrences of each zip code
zipcode_counts = loan['ZIPCode'].value_counts()
#Select the top 5 zip codes
top_5_zipcodes = zipcode_counts.head(5)
# Convert to DataFrame for easier plotting
top_5 = top_5_zipcodes.reset_index()
top_5.columns = ['ZIPCode', 'count']
#Plot the histogram
sns.barplot(data=top_5, x='ZIPCode', y='count')
plt.xlabel('Zipcode')
plt.ylabel('Count')
plt.title('Top 5 Zipcodes by Count')
plt.show()
All zip codes are for locations within California. With the highest count being for:
This could be due to the locations for All life bank. Also located in higher average income areas resulting in these zip codes having a higher count.
(Further bivariate analysis on income and zip codes will show better insight)
# Histplot & Boxplot
histogram_boxplot(loan, "Family")
# Barplot with percent
labeled_barplot(loan, "Family", perc=True, n=None, ascending = True)
# Histplot & Boxplot
histogram_boxplot(loan, "CCAvg")
# Max value
loan['CCAvg'].max()
10.0
# Histplot & Boxplot
histogram_boxplot(loan, "Education")
# Barplot
labeled_barplot(loan, 'Education', perc =True)
Undergrad = 1 / Graduate = 2 / Advanced/Professional = 3
# Histplot & Boxplot
histogram_boxplot(loan, "Mortgage")
# Total count
loan['Mortgage'].value_counts()
Mortgage
0 3462
98 17
119 16
89 16
91 16
...
547 1
458 1
505 1
361 1
541 1
Name: count, Length: 347, dtype: int64
# Histplot & Boxplot
histogram_boxplot(loan, "Personal_Loan")
# Boxplot
labeled_barplot(loan, 'Personal_Loan', perc =True)
Did this customer accept the personal loan offered in the last campaign?
Further analysis
# Histplot & Boxplot
histogram_boxplot(loan, "Securities_Account")
#Boxplot
labeled_barplot(loan, 'Securities_Account', perc =True)
Majority does not have a securities account with the bank 89.6%
Personal loans and Security have similiar total no and yes percentage
# Histplot & Boxplot
histogram_boxplot(loan, "CD_Account")
# Barplot
labeled_barplot(loan, 'CD_Account', perc =True)
Majority of customers do not have CD with the bank 94%
Only a small majority of customers do 6%
# Histplot & Boxplot
histogram_boxplot(loan, "Online")
# Barplot
labeled_barplot(loan, 'Online', perc =True)
# Histplot & Boxplot
histogram_boxplot(loan, "CreditCard")
# Barplot
labeled_barplot(loan, 'CreditCard', perc =True)
# Pull up data for reminder
loan.head()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
# Heatmap
plt.figure(figsize=(12, 7))
sns.heatmap(loan.corr(), annot=True, cmap="coolwarm");
Age and Experience have a strong positive correlation: As one gets older so does their work experience.
CCavg and Income has a moderate positive correlation: The higher the income the higher credit card average usage.
Personal loan and income has a moderate positive correlation: One can borrow more/take out personal loans as their income increases.
#Plot pairplot
sns.pairplot(data=loan, diag_kind="kde")
plt.show();
Age and experience have a positive correlation. The higher the age, the more experience a customer has.
No other positive correlations found
# Countplot - Personal_ loan & Family
plt.figure(figsize=(12, 8))
sns.countplot(data=loan, x='Personal_Loan', hue='Family')
plt.ylabel('Count')
plt.title('Personal loan & Family')
plt.xticks(rotation=45);
#Figure size
plt.figure(figsize=(10, 6))
#Boxplot
sns.boxplot(data=loan, x='Personal_Loan', y='Family')
plt.ylabel('Family')
plt.title('Personal Loan & Family')
plt.xticks(rotation=45);
Individuals that take personal loans tend to have larger families on average compared to those who do not.
Without personal loan
With personal loan
# Checking the distribution
loan.groupby('Personal_Loan')['Education'].describe()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Personal_Loan | ||||||||
| 0 | 4520.0 | 1.843584 | 0.839975 | 1.0 | 1.0 | 2.0 | 3.0 | 3.0 |
| 1 | 480.0 | 2.233333 | 0.753373 | 1.0 | 2.0 | 2.0 | 3.0 | 3.0 |
# Countplot - Personal Loan & Education
plt.figure(figsize=(12, 8))
sns.countplot(data=loan, x='Personal_Loan', hue='Education')
plt.ylabel('Count')
plt.title('Personal loan & Education')
plt.xticks(rotation=45);
#Figure size
plt.figure(figsize=(10, 6))
#Boxplot
sns.boxplot(data=loan, x='Personal_Loan', y='Education')
plt.ylabel('Education')
plt.title('Personal Loan & Education')
plt.xticks(rotation=45);
Undergrad = 1 / Graduate = 2 / Advanced/Professional = 3
Without personal loan
With personal loan
# Checking the distribution
loan.groupby('Personal_Loan')['CCAvg'].describe()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Personal_Loan | ||||||||
| 0 | 4520.0 | 1.729009 | 1.567647 | 0.0 | 0.6 | 1.4 | 2.3000 | 8.8 |
| 1 | 480.0 | 3.905354 | 2.097681 | 0.0 | 2.6 | 3.8 | 5.3475 | 10.0 |
#Figure size
plt.figure(figsize=(12, 6))
# Histogram and KDE plot for customers with personal loans
sns.histplot(loan[loan['Personal_Loan'] == 1]['CCAvg'], kde=True, color='blue', label='Has Personal Loan', bins=30)
# Histogram and KDE plot for customers without personal loans
sns.histplot(loan[loan['Personal_Loan'] == 0]['CCAvg'], kde=True, color='red', label='No Personal Loan', bins=30)
plt.xlabel('CCAvg')
plt.ylabel('Frequency')
plt.title('CCAvg Distribution: Customers With and Without Personal Loans')
plt.legend()
plt.show()
Without Perosnal loan
With Personal Loan
# Checking the distribution of age for customers with and without a personal loan
loan.groupby('Personal_Loan')['Age'].describe()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Personal_Loan | ||||||||
| 0 | 4520.0 | 45.367257 | 11.450427 | 23.0 | 35.0 | 45.0 | 55.0 | 67.0 |
| 1 | 480.0 | 45.066667 | 11.590964 | 26.0 | 35.0 | 45.0 | 55.0 | 65.0 |
# Figure size
plt.figure(figsize=(12, 6))
# Histogram and KDE plot for customers with personal loans
sns.histplot(loan[loan['Personal_Loan'] == 1]['Age'], kde=True, color='blue', label='Has Personal Loan', bins=30)
# Histogram and KDE plot for customers without personal loans
sns.histplot(loan[loan['Personal_Loan'] == 0]['Age'], kde=True, color='red', label='No Personal Loan', bins=30)
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Age Distribution: Customers With and Without Personal Loans')
plt.legend()
plt.show()
# Create age bins: 23 -67 is the age range in the dataset
age_bins = [20, 30, 40, 50, 60, 70]
age_labels = ['20-29', '30-39', '40-49', '50-59', '60-69']
# Create a new column for age ranges
loan['Age_Range'] = pd.cut(loan['Age'], bins=age_bins, labels=age_labels, right=False)
# Plot the data using age ranges
plt.figure(figsize=(12, 8))
sns.countplot(data=loan, x='Personal_Loan', hue='Age_Range')
plt.ylabel('Count')
plt.title('Personal Loan & Age Range')
plt.xticks(rotation=45)
plt.show()
#Figure size
plt.figure(figsize=(10, 6))
#Boxplot
sns.boxplot(data=loan, x='Personal_Loan', y='Age')
plt.ylabel('Age')
plt.title('Personal Loan & Age')
plt.xticks(rotation=45);
No outliers present.
Mean age is 45 for both groups.
Invidiuals aged 30 to 59 have the highest count for personal loans taken. (Age 30-39 is the highest)
Ages 20-29 have the lowest count for personal loans taken out, follwowed by 60-69 range.
Boxplot shows data is distributed similiarily for both groups. However those with personal loans slightly shorter whiskers on each side.
Further insight
Infer that ages 30-59 is when people are done with studying and settled into their careers. Whereas 20-29 people are students mostly and have student loans.
Ages 60-69 is more retirement age and focus on retirements option loans.
Domain knowledge: The Office of Federal Student Aid says California is the state with the most federal student loan debt. https://www.ppic.org/publication/student-loan-debt-in-california/
# Checking the distribution of age for customers with and without a personal loan
loan.groupby('Personal_Loan')['Income'].describe()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Personal_Loan | ||||||||
| 0 | 4520.0 | 66.237389 | 40.578534 | 8.0 | 35.0 | 59.0 | 84.0 | 224.0 |
| 1 | 480.0 | 144.745833 | 31.584429 | 60.0 | 122.0 | 142.5 | 172.0 | 203.0 |
# Figure size
plt.figure(figsize=(12, 6))
# Histogram and KDE plot for customers with personal loans
sns.histplot(loan[loan['Personal_Loan'] == 1]['Income'], kde=True, color='blue', label='Has Personal Loan', bins=30)
# Histogram and KDE plot for customers without personal loans
sns.histplot(loan[loan['Personal_Loan'] == 0]['Income'], kde=True, color='red', label='No Personal Loan', bins=30)
plt.xlabel('Income')
plt.ylabel('Frequency')
plt.title('Income Distribution: Customers With and Without Personal Loans')
plt.legend()
plt.show()
Individuals with personal loans tend to have a higher income than those who do not have a personal loan.
Without loans
With Loans
# Countplot - Personal Loan & Online
plt.figure(figsize=(12, 8))
sns.countplot(data=loan, x='Personal_Loan', hue='Online')
plt.ylabel('Count')
plt.title('Personal loan & Online')
plt.xticks(rotation=45);
Futher insight
#Countplot - Personal Loan & Credit Card
plt.figure(figsize=(12, 8))
sns.countplot(data=loan, x='Personal_Loan', hue='CreditCard')
plt.ylabel('Count')
plt.title('Personal loan & CreditCard')
plt.xticks(rotation=45);
Shows most customers do not own another credit card from another banking institutions with or without personal loans.
Those without a personal loan have a higher count of credit cards used by another bank.
Univariate analysis showed total 70% of customers do not use another credit card. The remaining 30% that do are split comparatively between both groups.
Further insight
For those without personal loans might have other credit cards as All life banking may have a limit issued to them. Therefore those individuals request additional credit card from other banks.
# Countplot - Personal Loan & CD Account
plt.figure(figsize=(12, 8))
sns.countplot(data=loan, x='Personal_Loan', hue='CD_Account')
plt.ylabel('Count')
plt.title('Personal loan & CD_Account')
plt.xticks(rotation=45);
Majority of customers do not have a CD_Account.
Individuals with a personal loan are more likely to have a CD_Account as well.
# Count - Personal Loan & Security Account
plt.figure(figsize=(12, 8))
sns.countplot(data=loan, x='Personal_Loan', hue='Securities_Account')
plt.ylabel('Count')
plt.title('Personal loan & Securities_Account')
plt.xticks(rotation=45);
# Countplot - CD_account & Family
plt.figure(figsize=(12, 8))
sns.countplot(data=loan, x='CD_Account', hue='Family')
plt.ylabel('Count')
plt.title('CD_Account & Family')
plt.xticks(rotation=45);
# Countplot - CD_account & Securities
plt.figure(figsize=(12, 8))
sns.countplot(data=loan, x='CD_Account', hue='Securities_Account')
plt.ylabel('Count')
plt.title('CD_Account & Security')
plt.xticks(rotation=45);
# Count - CD_account & Credit_Card
plt.figure(figsize=(12, 8))
sns.countplot(data=loan, x='CD_Account', hue='CreditCard')
plt.ylabel('Count')
plt.title('CD_Account & Credit_Card')
plt.xticks(rotation=45);
# Countplot CD_Account & Online
plt.figure(figsize=(12, 8))
sns.countplot(data=loan, x='CD_Account', hue='Online')
plt.ylabel('Count')
plt.title('CD_Account & Online')
plt.xticks(rotation=45);
# Create age bins: 23 -67 is the age range in the dataset
age_bins = [20, 30, 40, 50, 60, 70]
age_labels = ['20-29', '30-39', '40-49', '50-59', '60-69']
# Create a new column for age ranges
loan['Age_Range'] = pd.cut(loan['Age'], bins=age_bins, labels=age_labels, right=False)
# Plot the data using age ranges
plt.figure(figsize=(12, 8))
sns.countplot(data=loan, x='Education', hue='Age_Range')
plt.ylabel('Count')
plt.title('Education & Age Range')
plt.xticks(rotation=45)
plt.show()
Undergrad = 1 / Graduate = 2 / Advanced/Professional = 3
Level 1
Level 2
Level 3
#Import Library
from scipy.stats import chi2_contingency
features = ['Age', 'Experience', 'Income', 'ZIPCode', 'Family', 'CCAvg', 'Education', 'Mortgage', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard']
target = 'Personal_Loan'
# Calculate Pearson correlation for continuous variables
corr_matrix = loan[features + [target]].corr()
# Display correlation with the target variable
print("Correlation with Personal Loan:\n", corr_matrix[target].sort_values(ascending=False))
# Visualize the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Matrix')
plt.show()
# Calculate and display Chi-square test for categorical variables
def chi_square_test(cat_feature, target_feature):
contingency_table = pd.crosstab(loan[cat_feature], loan[target_feature])
chi2, p, dof, expected = chi2_contingency(contingency_table)
return p
categorical_features = ['Education', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard']
chi_square_results = {feature: chi_square_test(feature, target) for feature in categorical_features}
print("Chi-square test p-values:\n", chi_square_results)
Correlation with Personal Loan: Personal_Loan 1.000000 Income 0.502462 CCAvg 0.366889 CD_Account 0.316355 Mortgage 0.142095 Education 0.136722 Family 0.061367 Securities_Account 0.021954 Online 0.006278 CreditCard 0.002802 ZIPCode -0.002974 Experience -0.007413 Age -0.007726 Name: Personal_Loan, dtype: float64
Chi-square test p-values:
{'Education': 6.991473868665428e-25, 'Securities_Account': 0.14051497326319357, 'CD_Account': 7.398297503329848e-110, 'Online': 0.6928599643141484, 'CreditCard': 0.8843861223314504}
Continuous Variables:
Chi-square Test Categorical Variables:
What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution?
How many customers have credit cards?
What are the attributes that have a strong correlation with the target attribute (personal loan)?
How does a customer's interest in purchasing a loan vary with their age?
There doesn't appear to be a clear pattern indicating that customers of a particular age group are more or less likely to purchase a loan. Both groups have a similar distribution of ages, with no significant outliers or trends.
Data shows age ranges from 30-59 having more personal loans than those of 20-29 and 60-69
How does a customer's interest in purchasing a loan vary with their education?
Overall, the data suggests that there might be a relationship between education level and the likelihood of having a personal loan, with customers having higher education levels being slightly more inclined to purchase a loan.
Data shows individuals with graduate/advanced levels of education have higher personal loans total counts than others level of education.
Reason for this approach:
K-means clustering can effectively segment customers based on selected features. This approach provides valuable insights into different customer groups, allowing for targeted marketing and personalized services.
K-means clustering is sensitive to outliers and best to mask them for more accurate results.
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from scipy import stats
# Define features
features = ['Age', 'Income', 'Family', 'CreditCard', 'Education', 'Personal_Loan', 'CD_Account', 'Mortgage']
# Select the feature
X = loan[features]
# Scale the features using StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Calculate Z-scores for each feature
z_scores = np.abs(X_scaled)
# Define threshold for outlier detection (e.g., Z-score > 3)
threshold = 3
# Create a mask to identify outliers
outlier_mask = (z_scores > threshold).any(axis=1)
# Remove outliers from the dataset
X_cleaned = X[~outlier_mask]
summary_stats_cleaned = X_cleaned.describe()
print(summary_stats_cleaned)
Age Income Family CreditCard Education \
count 4300.000000 4300.000000 4300.000000 4300.000000 4300.000000
mean 45.374651 65.060000 2.384419 0.272326 1.858140
std 11.453349 39.420602 1.153043 0.445208 0.839181
min 23.000000 8.000000 1.000000 0.000000 1.000000
25% 35.000000 35.000000 1.000000 0.000000 1.000000
50% 45.000000 59.000000 2.000000 0.000000 2.000000
75% 55.000000 84.000000 3.000000 1.000000 3.000000
max 67.000000 205.000000 4.000000 1.000000 3.000000
Personal_Loan CD_Account Mortgage
count 4300.0 4300.0 4300.000000
mean 0.0 0.0 46.135116
std 0.0 0.0 80.343277
min 0.0 0.0 0.000000
25% 0.0 0.0 0.000000
50% 0.0 0.0 0.000000
75% 0.0 0.0 91.250000
max 0.0 0.0 359.000000
# visualize original and cleaned data for each feature
fig, axes = plt.subplots(nrows=2, ncols=len(features), figsize=(15, 6))
for i, feature in enumerate(features):
axes[0, i].hist(X[feature], bins=30, color='blue', alpha=0.5, label='Original')
axes[0, i].set_title(feature + ' (Original)')
axes[1, i].hist(X_cleaned[feature], bins=30, color='red', alpha=0.5, label='Cleaned')
axes[1, i].set_title(feature + ' (Cleaned)')
plt.tight_layout()
plt.show()
# Determine how many clusters to use
from sklearn.cluster import KMeans
# Elbow Method
inertia = []
K = range(1, 11)
for k in K:
kmeans = KMeans(n_clusters=k, random_state=0)
kmeans.fit(X_cleaned)
inertia.append(kmeans.inertia_)
#Plot graph
plt.figure(figsize=(8, 5))
plt.plot(K, inertia, 'bo-')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.title('Elbow Method For Optimal k')
plt.show()
Further analysis
#Prepare the Data (X_cleaned is the cleaned dataset not loan)
X_cleaned_scaled = scaler.fit_transform(X_cleaned) # Scale the cleaned dataset if necessary
#Elbow method showed 3
k = 3
# Apply K-means Clustering
kmeans = KMeans(n_clusters=k, random_state=42)
#Fit the Model
kmeans.fit(X_cleaned_scaled)
# Predict Clusters
cluster_labels = kmeans.predict(X_cleaned_scaled)
# Visualize Results
plt.scatter(X_cleaned_scaled[:, 0], X_cleaned_scaled[:, 1], c=cluster_labels, cmap='viridis')
plt.xlabel(features[0])
plt.ylabel(features[1])
plt.title('K-means Clustering Results')
plt.colorbar(label='Cluster')
plt.show()
# use X_cleaned (not loan, as error will occur as loan has 5000 entries)
X_cleaned.info()
<class 'pandas.core.frame.DataFrame'> Index: 4300 entries, 0 to 4999 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 4300 non-null int64 1 Income 4300 non-null int64 2 Family 4300 non-null int64 3 CreditCard 4300 non-null int64 4 Education 4300 non-null int64 5 Personal_Loan 4300 non-null int64 6 CD_Account 4300 non-null int64 7 Mortgage 4300 non-null int64 dtypes: int64(8) memory usage: 302.3 KB
# Perform k-means clustering
kmeans = KMeans(n_clusters=k, random_state=42)
cluster_labels = kmeans.fit_predict(X_cleaned)
# Add cluster labels to your DataFrame
X_cleaned['Cluster'] = cluster_labels
# Group by cluster labels and calculate the mean of numeric features
cluster_analysis = X_cleaned.groupby('Cluster').mean(numeric_only=True)
# Print the cluster analysis
print(cluster_analysis)
Age Income Family CreditCard Education Personal_Loan \
Cluster
0 45.508697 66.992780 2.380374 0.276666 1.852314 0.0
1 45.281324 46.970449 2.453901 0.268322 1.943262 0.0
2 44.565111 88.191646 2.270270 0.248157 1.724816 0.0
CD_Account Mortgage
Cluster
0 0.0 0.024943
1 0.0 119.742317
2 0.0 238.336609
Each cluster represents a distinct group of customers with unique characteristics and behaviors related to income, family size, banking products usage, and financial habits.
Cluster 0: Represents individuals with moderate income, small families, low mortgage amounts, and no personal loans or CD accounts.
Cluster 1: Represents individuals with lower income compared to Cluster 0, slightly larger families, and higher mortgage amounts.
Cluster 2: Represents individuals with higher income, smaller families, and significantly higher mortgage amounts compared to the other clusters.
# Pull up data for review
loan.head()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | Age_Range | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 20-29 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 40-49 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 30-39 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 30-39 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 30-39 |
# Check missing values
loan.isnull().sum()
ID 0 Age 0 Experience 0 Income 0 ZIPCode 0 Family 0 CCAvg 0 Education 0 Mortgage 0 Personal_Loan 0 Securities_Account 0 CD_Account 0 Online 0 CreditCard 0 Age_Range 0 dtype: int64
# outlier detection using boxplot
numeric_columns = loan.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(15, 12))
for i, variable in enumerate(numeric_columns):
plt.subplot(4, 4, i + 1)
plt.boxplot(data[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
# Personal loan as target variable
X = loan.drop(["Personal_Loan"], axis=1)
Y = loan["Personal_Loan"]
X = pd.get_dummies(X, drop_first=True)
# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1
)
# Print result
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set : (3500, 17) Shape of test set : (1500, 17) Percentage of classes in training set: Personal_Loan 0 0.905429 1 0.094571 Name: proportion, dtype: float64 Percentage of classes in test set: Personal_Loan 0 0.900667 1 0.099333 Name: proportion, dtype: float64
# Create model 0
model0 = DecisionTreeClassifier(random_state=1)
model0.fit(X_train, y_train)
DecisionTreeClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(random_state=1)
Using a decision tree to predict whether customers are likely to take a personal loan, and considering that personal loan datasets often have imbalanced classes (with fewer customers taking loans than not), a combination of precision, recall, and F1-score would be best but will focus on recall.
Why Decision Tree?
Model can make wrong predictions as:
False Negative (FN): Predicting that a customer will not take a personal loan, but in reality, the customer does take a loan.
False Positive (FP): Predicting that a customer will take a personal loan, but in reality, the customer does not take a loan.
Which case is more important?
False Negative (FN): Predicting that a customer will not take a personal loan, but in reality, the customer takes a loan.
Consequence: The bank misses out on targeting potential loan customers, leading to a loss of potential revenue.
False Positive (FP): Predicting that a customer will take a personal loan, but in reality, the customer does not take a loan.
Consequence: The bank may incur additional marketing costs by targeting customers who are not interested in loans, leading to inefficient allocation of marketing resources.
How to reduce the losses?
Maximize Recall: To minimize the loss from missed opportunities (False Negatives), the bank should focus on maximizing recall. Greater recall increases the chances of identifying all potential loan customers, thus capturing more revenue opportunities.
Balance Precision and Recall: While maximizing recall, it is also important to consider precision to avoid excessive marketing costs. Therefore, a good balance between recall and precision is necessary to optimize both marketing efficiency and revenue capture.
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
confusion_matrix_sklearn(model0, X_train, y_train)
decision_tree_perf_train_without = model_performance_classification_sklearn(
model0, X_train, y_train
)
decision_tree_perf_train_without
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
confusion_matrix_sklearn(model0, X_test, y_test)
decision_tree_perf_test_without = model_performance_classification_sklearn(
model0, X_test, y_test
)
decision_tree_perf_test_without
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.978 | 0.90604 | 0.876623 | 0.891089 |
The decision tree model's results are promising, showing high accuracy, recall, precision, and F1 score.
Precision score had the biggest change from 1.0 to 0.87.
Accuracy remains the highest followed by recall.
If the frequency of class A is 10% and the frequency of class B is 90%, then class B will become the dominant class and the decision tree will become biased toward the dominant classes
In this case, we will set class_weight = "balanced", which will automatically adjust the weights to be inversely proportional to the class frequencies in the input data
class_weight is a hyperparameter for the decision tree classifier
# Create model 1 with weight class
model = DecisionTreeClassifier(random_state=1, class_weight="balanced")
model.fit(X_train, y_train)
DecisionTreeClassifier(class_weight='balanced', random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(class_weight='balanced', random_state=1)
confusion_matrix_sklearn(model, X_train, y_train)
decision_tree_perf_train = model_performance_classification_sklearn(
model, X_train, y_train
)
decision_tree_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
confusion_matrix_sklearn(model, X_train, y_train)
decision_tree_perf_test = model_performance_classification_sklearn(
model, X_test, y_test
)
decision_tree_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.977333 | 0.872483 | 0.896552 | 0.884354 |
A more general approach has been taken causing changes
Accuracy, recall, Precision and F1 are lower compared to the trianing set but still remain high.
Might suggest overfitting
Reason for use?
In the context of determining personal loans, hyperparameter tuning with grid search is crucial for optimizing the decision tree model's performance.
maximizes the model's ability to correctly predict whether a customer will accept a personal loan or not.
# Choose classifier
estimator = DecisionTreeClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {
"class_weight": [None, "balanced"],
"max_depth": np.arange(2, 7, 2), # [2, 4, 6]
"max_leaf_nodes": [50, 75, 150, 250],
"min_samples_split": [10, 30, 50, 70],
}
# Recall scoring used to compare parameter combinations
acc_scorer = make_scorer(recall_score)
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
DecisionTreeClassifier(class_weight='balanced', max_depth=2, max_leaf_nodes=50,
min_samples_split=10, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(class_weight='balanced', max_depth=2, max_leaf_nodes=50,
min_samples_split=10, random_state=1)confusion_matrix_sklearn(estimator, X_train, y_train)
decision_tree_tune_perf_train = model_performance_classification_sklearn(
estimator, X_train, y_train
)
decision_tree_tune_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.790286 | 1.0 | 0.310798 | 0.474212 |
models f1 and precisions have decreased a lot comapred to the class weight model
recall remains very high
confusion_matrix_sklearn(estimator, X_test, y_test)
decision_tree_tune_perf_test = model_performance_classification_sklearn(
estimator, X_test, y_test
)
decision_tree_tune_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.779333 | 1.0 | 0.310417 | 0.473768 |
Models is able to still get a perfect recall score of 1.0 on both training and test sets which show sthat it can perfectly understand unseen data.
Not much change on train and tet set for Accuracy, Precision and F1 score
# Get features from model
feature_names = list(X_train.columns)
importances = estimator.feature_importances_
indices = np.argsort(importances)
# Visualize decision tree branches
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
estimator,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree
print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
|--- Income <= 92.50 | |--- CCAvg <= 2.95 | | |--- weights: [1344.67, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- weights: [64.61, 79.31] class: 1 |--- Income > 92.50 | |--- Education <= 1.50 | | |--- weights: [272.80, 306.65] class: 1 | |--- Education > 1.50 | | |--- weights: [67.92, 1364.05] class: 1
Summary: High-Income, High Education: Most promising group for personal loans and should be prioritized for targeted marketing.
Income and Credit Card Spending (CCAvg): Income ≤ 92.50:
Income > 92.50:
# Get feature importances from model
importances = estimator.feature_importances_
importances
array([0. , 0. , 0. , 0.82007181, 0. ,
0. , 0.06262835, 0.11729984, 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. ])
# Importance of features in the tree building
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Reason for use?
# total impurity of the leaves in a decision tree
clf = DecisionTreeClassifier(random_state=1, class_weight="balanced") # {0: 0.15, 1: 0.85}
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = abs(path.ccp_alphas), path.impurities
pd.DataFrame(path)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.000000e+00 | -7.832774e-15 |
| 1 | 3.853725e-19 | -7.832388e-15 |
| 2 | 4.729571e-19 | -7.831915e-15 |
| 3 | 4.729571e-19 | -7.831442e-15 |
| 4 | 7.707449e-19 | -7.830671e-15 |
| 5 | 1.051016e-18 | -7.829620e-15 |
| 6 | 1.261219e-18 | -7.828359e-15 |
| 7 | 8.338059e-18 | -7.820021e-15 |
| 8 | 1.257806e-17 | -7.807443e-15 |
| 9 | 1.574681e-04 | 3.149363e-04 |
| 10 | 2.857143e-04 | 8.863649e-04 |
| 11 | 3.083987e-04 | 1.503162e-03 |
| 12 | 3.116508e-04 | 2.438115e-03 |
| 13 | 3.130853e-04 | 3.064285e-03 |
| 14 | 3.604006e-04 | 4.505887e-03 |
| 15 | 3.628623e-04 | 5.957336e-03 |
| 16 | 3.797005e-04 | 7.476138e-03 |
| 17 | 5.220569e-04 | 7.998195e-03 |
| 18 | 5.375794e-04 | 8.535775e-03 |
| 19 | 5.880239e-04 | 9.711822e-03 |
| 20 | 7.689471e-04 | 1.048077e-02 |
| 21 | 1.003878e-03 | 1.148465e-02 |
| 22 | 1.213013e-03 | 1.391067e-02 |
| 23 | 1.343845e-03 | 1.525452e-02 |
| 24 | 1.416204e-03 | 1.667072e-02 |
| 25 | 1.431094e-03 | 1.953291e-02 |
| 26 | 1.693744e-03 | 2.292040e-02 |
| 27 | 1.981730e-03 | 2.688386e-02 |
| 28 | 2.150414e-03 | 2.903427e-02 |
| 29 | 2.375809e-03 | 3.141008e-02 |
| 30 | 3.344493e-03 | 3.475457e-02 |
| 31 | 3.602932e-03 | 4.196044e-02 |
| 32 | 3.729690e-03 | 4.569013e-02 |
| 33 | 4.920880e-03 | 5.061101e-02 |
| 34 | 1.007808e-02 | 7.076717e-02 |
| 35 | 2.255792e-02 | 9.332509e-02 |
| 36 | 5.564782e-02 | 2.046207e-01 |
| 37 | 2.953793e-01 | 5.000000e-01 |
# Figure size
fig, ax = plt.subplots(figsize=(10, 5))
# PLot graph
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
Reason for use
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(
random_state=1, ccp_alpha=ccp_alpha, class_weight="balanced"
)
clf.fit(X_train, y_train)
clfs.append(clf)
print(
"Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.2953792759992323
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
recall_train = []
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = recall_score(y_train, pred_train)
recall_train.append(values_train)
recall_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = recall_score(y_test, pred_test)
recall_test.append(values_test)
train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]
# Compare alphas & Recall - train & test
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(
ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post",
)
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
These lines show how the recall scores change with different values of alpha for both datasets.
Both train and test have high recall and follow each other meaning the model generalizes well to unseen data
# creating the model where we get highest train and test recall
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.002375808619774645, class_weight='balanced',
random_state=1)
# Create best_model
confusion_matrix_sklearn(best_model, X_train, y_train)
decision_tree_post_perf_train = model_performance_classification_sklearn(
best_model, X_train, y_train
)
decision_tree_post_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.956857 | 1.0 | 0.686722 | 0.814268 |
confusion_matrix_sklearn(best_model, X_test, y_test)
decision_tree_post_test = model_performance_classification_sklearn(
best_model, X_test, y_test
)
decision_tree_post_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.948667 | 0.993289 | 0.660714 | 0.793566 |
# Visualize decision tree branches
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
best_model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(best_model, feature_names=feature_names, show_weights=True))
|--- Income <= 92.50 | |--- CCAvg <= 2.95 | | |--- weights: [1344.67, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- CD_Account <= 0.50 | | | |--- CCAvg <= 3.95 | | | | |--- weights: [41.42, 52.87] class: 1 | | | |--- CCAvg > 3.95 | | | | |--- weights: [23.19, 0.00] class: 0 | | |--- CD_Account > 0.50 | | | |--- weights: [0.00, 26.44] class: 1 |--- Income > 92.50 | |--- Education <= 1.50 | | |--- Family <= 2.50 | | | |--- Income <= 103.50 | | | | |--- CCAvg <= 3.21 | | | | | |--- weights: [22.09, 0.00] class: 0 | | | | |--- CCAvg > 3.21 | | | | | |--- weights: [2.76, 15.86] class: 1 | | | |--- Income > 103.50 | | | | |--- weights: [239.11, 0.00] class: 0 | | |--- Family > 2.50 | | | |--- weights: [8.84, 290.79] class: 1 | |--- Education > 1.50 | | |--- Income <= 116.50 | | | |--- CCAvg <= 2.85 | | | | |--- Income <= 106.50 | | | | | |--- weights: [37.55, 0.00] class: 0 | | | | |--- Income > 106.50 | | | | | |--- weights: [23.75, 37.01] class: 1 | | | |--- CCAvg > 2.85 | | | | |--- weights: [6.63, 153.32] class: 1 | | |--- Income > 116.50 | | | |--- weights: [0.00, 1173.72] class: 1
Income and Credit Card Average Spend (CCAvg):
-Individuals with lower income and lower credit card average spend (CCAvg) are less likely to take out personal loans.
Education and Family Size:
Individuals with lower education levels and smaller family sizes are more likely to take out personal loans, particularly if their income falls within a specific range.
Higher income individuals with higher education levels are also more likely to take out personal loans, especially if they have moderate to high credit card average spend.
# Importance of features in the tree building
importances = best_model.feature_importances_
indices = np.argsort(importances)
# Importance of features in the tree building
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Income remains high in both pre and post pruning decision tree models
Post pruning includes more features such as Family and CD_account (income, family, education, CCAvg, CD_Account)
# Training performance comparison
models_train_comp_df = pd.concat(
[
decision_tree_perf_train_without.T,
decision_tree_perf_train.T,
decision_tree_tune_perf_train.T,
decision_tree_post_perf_train.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree without class_weight",
"Decision Tree with class_weight",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Decision Tree without class_weight | Decision Tree with class_weight | Decision Tree (Pre-Pruning) | Decision Tree (Post-Pruning) | |
|---|---|---|---|---|
| Accuracy | 1.0 | 1.0 | 0.790286 | 0.956857 |
| Recall | 1.0 | 1.0 | 1.000000 | 1.000000 |
| Precision | 1.0 | 1.0 | 0.310798 | 0.686722 |
| F1 | 1.0 | 1.0 | 0.474212 | 0.814268 |
# Testing performance comparison
models_test_comp_df = pd.concat(
[
decision_tree_perf_test_without.T,
decision_tree_perf_test.T,
decision_tree_tune_perf_test.T,
decision_tree_post_test.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Decision Tree without class_weight",
"Decision Tree with class_weight",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Test set performance comparison:")
models_test_comp_df
Test set performance comparison:
| Decision Tree without class_weight | Decision Tree with class_weight | Decision Tree (Pre-Pruning) | Decision Tree (Post-Pruning) | |
|---|---|---|---|---|
| Accuracy | 0.978000 | 0.977333 | 0.779333 | 0.948667 |
| Recall | 0.906040 | 0.872483 | 1.000000 | 0.993289 |
| Precision | 0.876623 | 0.896552 | 0.310417 | 0.660714 |
| F1 | 0.891089 | 0.884354 | 0.473768 | 0.793566 |
Overall, the decision tree with post-pruning seems to strike the best balance between model complexity and generalization performance, achieving high accuracy and recall on both the training and test sets.
By focusing on recall, especially for predicting personal loan uptake, post-pruning ensures that most customers who would choose to take a loan are correctly identified, minimizing the risk of missing potential loan takers while keeping false positives reasonably low.tly identified, minimizing the risk of missing potential loan takers.
Analyzed the dataset to uncover patterns and insights, focusing on identifying key indicators of whether a customer is likely to take a personal loan in the next campaign. Employed standard Exploratory Data Analysis (EDA) techniques, followed by K-means clustering and decision tree models to pinpoint characteristics of potential customers.
Overall, the majority of All Life Banking customers within this dataset did not accept a personal loan in the last campaign.
Most important features that predict if someone will accept the personal loan are: Income, Education, Family and CCAvg spending.
There are potential customers the bank can target to increase their conversion rate on the next campaign. An ideal customer to target would be:
Consider the following
Zip codes provide valuable insight to location and economic averages for customers in those areas. Analyzing the different location can further segment customers allowing the bank to better target them for future campaigns.
More data on online services: when, why and how long. Can be used to discover pain points for customers.
How long they have been a customer/What year did they join. Who is most likely to take out a loan based on when they joined the bank could be helpful for future campaigns.
Summary Strategy
By understanding the unique characteristics and preferences of each cluster, organizations can develop targeted marketing strategies that resonate with their audience, drive engagement, and ultimately lead to increased customer satisfaction and loyalty
Marketing Strategies for Each Cluster
Targeted Messaging: Craft messaging that emphasizes financial stability and responsible spending.
Product Offerings: Offer low-interest credit cards or savings accounts to encourage savings and responsible credit card usage. Provide educational resources on financial planning and budgeting to support their financial goals.
Promotions: Offer promotions or rewards for opening new savings accounts or credit card accounts.
Personalization: Use personalized marketing campaigns based on their specific financial needs and preferences.
Financial Education: Provide resources and workshops on managing finances, especially focusing on budgeting and debt management.
Mortgage Services: Offer mortgage refinancing options or home equity loans with attractive rates to assist with homeownership goals.
Credit Building Products: Provide credit-building products or services to help improve credit scores and qualify for better financial opportunities.
Family-Oriented Promotions: Create family-oriented promotions or events to appeal to their slightly larger family size.
Affluent Lifestyle: Highlight exclusive or premium banking services, such as concierge banking or wealth management services, tailored to their higher income level.
Investment Opportunities: Offer investment products or portfolio management services to grow their wealth.
Luxury Rewards: Provide luxury rewards or perks for high-value clients, such as exclusive access to events or travel benefits.
Personalized Financial Planning: Offer personalized financial planning services to help them achieve their long-term financial goals.